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ABSTRACT 

Summary: RNA interference (RNAi) is known to play an important part 
in defence against viruses in a range of species. Second-generation 
sequencing technologies allow us to assay these systems and the 
small RNAs that play a key role with unprecedented depth. 
However, scientists need access to tools that can condense, analyse 
and display the resulting data. Here, we present viRome, a package 
for R that takes aligned sequence data and produces a range of 
essential plots and reports. 

Availability and implementation: viRome is released under the BSD 
license as a package for R available for both Windows and Linux 
http://virome.sf.net. Additional information and a tutorial is avail- 
able on the ARK-Genomics website: http://www.ark-genomics.org/ 
bioinformatics/virome. 
Contact: mick.watson@roslin.ed.ac.uk 
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1 INTRODUCTION 

RNA interference (RNAi) is mediated by small RNAs, such as 
micoRNAs (miRNAs) of 21-22 nt (Lagos-Quintana et al, 2001), 
small interfering RNAs (siRNAs) of 21-22 nt (Bernstein et al, 
2001; Zamore et al, 2000) and PlWI-interacting RNAs (piRNAs) 
of 24-30 nt (Aravin et al, 2003; Brennecke et al, 2007), and these 
molecules regulate many biological processes. These pathways are 
also a major part of the antiviral response in both insects and 
plants, including a variety of important mosquito-borne diseases 
of humans and animals, such as West Nile Virus, Dengue Virus 
and Chikungunya Virus. In arthropods, these are characterized 
by the production of 21-22 nt virus-derived small interfering 
RNAs (viRNAs) or 24-30 nt viral piRNA-like molecules (Blair, 
2011; Donald et al, 2012; Myles et al, 2009). 

Second-generation sequencing allows scientists to assay these 
systems in unprecedented depth, and short reads capture both 
the 21-22 nt siRNAs and the 24-30 nt piRNAs. However, there 
is a need for scientists to be able to summarize, analyse and 
visualize the results of such experiments. Here, we present 
viRome, a package for R, which takes aligned sequencing data 
in the BAM format (Li et al, 2009) and produces a variety of 
plots and reports that are essential to the analysis of data from 
viral siRNA datasets. 
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Software packages to analyse viral siRNA data exist. 
Paparrazi (Vodovar et al, 2011) is designed to reconstruct viral 
genomes from siRNA data and produces some similar plots to 
viRome. Alternatively, Visitor (Antoniewski, 2011), an infor- 
matic pipeline for analysing short-read viRNA data, also pro- 
duces several similar plots. However, both are implemented in 
Perl and are limited to the Linux/Unix operating system; they 
include alignment as part of the analysis; therefore, using an 
alternative aligner would require programming skills; finally, 
the plots are generated in batch mode; hence, there is no inter- 
action between the user and the software. 

As a package for R, viRome improves on these software pack- 
ages in several ways, including (i) viRome allows interaction be- 
tween the user and the software during report and graph 
generation, (ii) viRome is available on any operating system 
that supports R and has been tested on Microsoft Windows 
and several Linux distributions, (iii) viRome separates visualiza- 
tion from alignment; therefore, the user is free to use any align- 
ment software they wish and (iv) as an R package, viRome 
integrates seamlessly with other R packages from the 
Bioconductor project (Gentleman et al, 2004). 



2 ANALYSIS AND VISUALIZATION 

As input, viRome takes aligned sequence data in the BAM for- 
mat. Many tools exist for alignment (Fonseca et al, 2012) and 
provided they support the SAM/BAM format, viRome is cap- 
able of working with their output. Many of the functions within 
viRome attempt to summarize millions of data points into tables 
and plots that allow biological interpretation. One of the benefits 
of viRome is that most functions return the summarized data, as 
well as creating a plot. This allows users to create their own plots 
if they wish. Figure 1 shows a selection of plots produced by 
viRome. 

Global analyses: One of the first requirements is to plot a 
histogram of the lengths of mapped reads — a peak at 21-22 nt 
implying an siRNA response, and a high frequency of 24-30 nt 
with a peak at 28 a piRNA response. In viRome, this can be 
created using the barplot.bam function. Users may also create 
a report using the sequence. report function. This produces a 
data. frame in R that summarizes and counts the sequences 
aligned to each base in a given reference sequence. Users can 
see the exact sequence, its length, the location and strand of 
the alignment plus a count of how many times that sequence 
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Fig. 1. Clockwise from top-left: a plot of read-length distribution; genomic location of 21-22 nt reads; genomic location of 25-29 nt reads; heatmap and 
sequence logo showing Ti bias; heatmap and sequence logo showing A 10 bias; barplot showing Ti bias; 5 ; read distance plot for 25-29 nt reads showing 
enrichment of 10 nt overlap; and a heatmap showing the genomic location of 18-36 bp reads (counts per position: black is low, red is high) 



occurs. As a data. frame, this can be easily exported to Excel or 
other spreadsheet software. 

Location-based analyses: Although many viruses are targeted 
by the siRNA pathway throughout the genome, others are tar- 
geted only in limited regions (Sabin et al., 2013). A heatmap 
representing the occurrence of all mapped read lengths 
across all genomic locations can be produced using the 
size. position. heatmap function, and barplots showing counts for 
each genomic location for each read length generated using the 
stacked.barplot function. 

Read-based analyses: Read-based analyses allow users to focus 
on patterns in particular subsets of reads. Single barplots show- 
ing the location, strand and count of reads mapping through- 
out the genome can be visualized using the position. bar plot 
function. The base composition of subsets of reads can be calcu- 
lated with the make.pwm function. Sequence signatures of the 
piRNA pathway include a strong Ui bias in primary, antisense 
piRNAs and following 'ping-pong' cycle amplification involving 
AG03 and Aub, a strong A 10 bias in secondary sense piRNAs in 
Drosophila (Brennecke et aL, 2007). Similar motifs have been 
found in piRNAs and viral piRNA-like molecules in mosquitoes 
or derived cell lines (Morazzani et aL, 2012; Schnettler et aL, 
2013; Vodovar et aL, 2012). The output of make.pwm can be 
plotted as a heatmap using the pwm. heatmap function, or used 
with external packages such as seqLogo and motifStack to pro- 
duce sequence logos. Finally, the 5 r -ends of complementary 
piRNAs are most frequently separated by lOnt (Brennecke 
et aL, 2007; Vodovar et aL, 2012) because of the earlier described 
'ping-pong' amplification. The distance between 5 r -ends of 
piRNAs mapping to opposite strands can be summarized and 
visualized using the read.dist.plot function. 



3 CONCLUSIONS 

Deep sequencing experiments have revealed a variety of interest- 
ing and unique signatures of the miRNA, siRNA and piRNA 
pathways, and there is a need for software that allows scientists 
to process such data. We have developed viRome, a package for 
R that allows the interactive generation of a range of informative 
plots and reports. As an R package, viRome is available on 
a range of operating systems. viRome is released under an 



open-source license and can be downloaded from http:// 
virome.sf.net, where a tutorial is also available. 
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